SportsStats Olympic Athletes Analysis

At a glance

Project: Data cleaning and analysis
Date: June 2025
Category: Data analysis

Project overview

This project analyzes over 120 years of Olympic Games data for patterns in performance, participation, and country results. Using Python and SQL, it explores demographics, medals, and how physical attributes vary by sport.

The work highlights how gender participation has changed, which countries lead medal counts, and how height, weight, and age relate to sport—supporting a narrative on Olympic history and athletic excellence.

Dataset / source

Historical Olympic athlete data (e.g., athlete_events.csv with 271K+ rows) and noc_regions.csv for NOC-to-region mapping, spanning 1896–2016.

Tools used

Python, pandas, matplotlib, DuckDB (SQL), Jupyter-style notebook workflow.

What problem you solved

A large, messy athlete-level history needed to be summarized into reliable answers about medals, participation, gender trends, sport-level physical profiles, and country efficiency.

Key insights

Medals
USA, Russia, and Germany lead total medal counts in the cleaned sample (e.g., USA 4,383; Russia 3,610; Germany 3,189).
Gender participation
Male vs female athlete counts over time show substantial growth in women’s participation, especially after the 1980s.
Physical profiles
Sport-level averages differ widely (e.g., basketball vs rhythmic gymnastics age profiles).
Efficiency
Medals-per-athlete metrics highlight efficient smaller programs (e.g., Jamaica at about 1.9 medals per athlete in the sample shown).

Embedded project — notebook excerpt

Best Viewed on Larger Screens

For the best experience viewing this analysis notebook, please use a laptop, desktop computer, or tablet in landscape mode.

SportsStats-Project-file.ipynb

import pandas as pd

import matplotlib.pyplot as plt

# Load your Olympic athlete data
df_olympics = pd.read_csv("athlete_events.csv")
noc_df = pd.read_csv("noc_regions.csv")

df_olympics.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271116 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 20.7+ MB

# 1. Remove rows with null in critical columns
df_olympics = df_olympics.dropna(subset=['Name', 'Sex', 'Age', 'Height', 'Weight', 'NOC', 'Sport', 'Event'])

# 2: Remove duplicate rows
df_olympics = df_olympics.drop_duplicates()

# 3. Convert data types
df_olympics['Age'] = df_olympics['Age'].astype(int)
df_olympics['Height'] = df_olympics['Height'].astype(int)
df_olympics['Weight'] = df_olympics['Weight'].astype(int)

# 4. Join with noc_regions.csv on 'NOC'
df_merged = df_olympics.merge(noc_df, on='NOC', how='left')

# 5. Fill missing region values (if needed)
df_merged['region'] = df_merged['region'].fillna('Unknown')

# Total records
print("Total Records:", df_merged.shape[0])

# Unique countries
print("Total Countries:", df_merged['region'].nunique())

# Top 5 Medal-Winning Countries
print("\nTop Countries by Medal Count:")
print(df_merged[df_merged['Medal'].notna()]['region'].value_counts().head(5))

# Male vs Female participation
print("\nGender Participation:")
print(df_merged['Sex'].value_counts())

Total Records: 206152 Total Countries: 206 Top Countries by Medal Count: region USA 4383 Russia 3610 Germany 3189 Australia 1210 Italy 1060 Name: count, dtype: int64 Gender Participation: Sex M 139441 F 66711 Name: count, dtype: int64

import duckdb

query = """
    SELECT region, COUNT(*) AS total_medals
    FROM 'olympics_cleaned.csv'
    WHERE medal IS NOT NULL
    GROUP BY region
    ORDER BY total_medals DESC
    LIMIT 10
"""

total_medals = duckdb.query(query).df()
print(total_medals)

region total_medals 0 USA 4383 1 Russia 3610 2 Germany 3189 3 Australia 1210 4 Italy 1060 5 Canada 1060 6 UK 1031 7 China 989 8 France 987 9 Japan 843

# Plot Top 10 Countries by Total Medals
plt.figure(figsize=(10,6))
plt.bar(df_medals['region'].head(10), df_medals['total_medals'].head(10), color='steelblue')
plt.title('Top 10 Countries by Total Medals')
plt.ylabel('Total Medals')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

gender_participation = duckdb.query("""
    SELECT 
        year, 
        sex, 
        COUNT(DISTINCT id) AS athlete_count
    FROM 'olympics_cleaned.csv'
    GROUP BY year, sex
    ORDER BY year, sex
""").df()

male_data = gender_participation[gender_participation['Sex'] == 'M']
female_data = gender_participation[gender_participation['Sex'] == 'F']

plt.figure(figsize=(12, 6))
plt.plot(male_data['Year'], male_data['athlete_count'], label='Male', color='blue')
plt.plot(female_data['Year'], female_data['athlete_count'], label='Female', color='red')
plt.title('Male vs Female Athlete Participation Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Athletes')
plt.legend()
plt.show()

query = """
SELECT 
    Sport,
    COUNT(*) AS total_athletes,
    AVG(Age) AS avg_age,
    AVG(Height) AS avg_height,
    AVG(Weight) AS avg_weight
FROM "olympics_cleaned.csv"
WHERE Height IS NOT NULL AND Weight IS NOT NULL AND Age IS NOT NULL
GROUP BY Sport
ORDER BY avg_height DESC
"""

Physical_profiles_by_sport = duckdb.query(query).df()
print(Physical_profiles_by_sport.head(10))

Sport total_athletes avg_age avg_height avg_weight 0 Basketball 3668 25.453926 191.226554 85.801527 1 Volleyball 3277 25.237107 186.978944 78.900214 2 Beach Volleyball 538 29.135688 186.202602 79.089219 3 Water Polo 2719 25.705774 185.036043 84.568224 4 Rowing 7790 25.413607 184.265982 80.168164 5 Handball 3319 26.492618 183.483579 81.503465 6 Baseball 846 26.309693 182.599291 85.717494 7 Tug-Of-War 20 26.500000 182.550000 91.800000 8 Bobsleigh 2205 28.650794 181.608617 89.161451 9 Motorboating 1 27.000000 181.000000 77.000000

query = """
SELECT 
    Sport,
    AVG(Age) AS avg_age
FROM "olympics_cleaned.csv"
WHERE Medal IS NOT NULL
  AND Sport IN ('Rhythmic Gymnastics', 'Equestrianism')
GROUP BY Sport
"""

Avg_Age_by_Sport = duckdb.query(query).df()
print(Avg_Age_by_Sport)

Sport avg_age 0 Rhythmic Gymnastics 18.911290 1 Equestrianism 35.315234

query = """
SELECT 
    region,
    COUNT(DISTINCT ID) AS total_athletes,
    COUNT(Medal) AS total_medals,
    ROUND(COUNT(Medal) * 1.0 / COUNT(DISTINCT ID), 2) AS medal_efficiency
FROM 'olympics_cleaned.csv'
WHERE Medal IS NOT NULL
GROUP BY region
HAVING total_athletes > 0
ORDER BY medal_efficiency DESC
"""

medal_efficiency_per_athletes = duckdb.query(query).df()
print(medal_efficiency_per_athletes.head(10))

region total_athletes total_medals medal_efficiency 0 Namibia 1 4 4.0 1 Kuwait 1 2 2.0 2 Liechtenstein 4 8 2.0 3 Grenada 1 2 2.0 4 Mozambique 1 2 2.0 5 Suriname 1 2 2.0 6 Afghanistan 1 2 2.0 7 Ecuador 1 2 2.0 8 Costa Rica 2 4 2.0 9 Jamaica 81 154 1.9

Embedded project — presentation (PDF)

Best Viewed on Larger Screens

For the best experience viewing this presentation, please use a laptop, desktop computer, or tablet in landscape mode.

SportsStats Olympic Athletes Analysis Presentation

Explore More Projects

Discover other data analysis projects and interactive dashboards

View All Projects